The Pennsylvania State University, Spring 2021 Stat 415-001, Hyebin Song

Hypothesis Testing

Go to course main page

Hypothesis TestingIntroduction to Hypothesis TestingLearning objectivesHypothesis testing frameworkHypothesisTest statistic and rejection regionTwo types of testing errorsTest statistic and p-valueSteps to perform a hypothesis testing with the significance level Duality of confidence intervals with hypothesis testsTests about one meanLearning objectiveSummaryTests about two meansLearning objectiveTwo independent samplesA paired sampleSummaryTests about proportions Learning objectiveOne sample (large n)One sample (exact)Two independent samples (large and )Test about variancesLearning objectiveTest about variancesMore examples on calculating Type I and II error probabilities and power of a statistical test Learning objective

Introduction to Hypothesis Testing

Learning objectives

 

Hypothesis testing framework

Hypothesis

We have a parameter space , which can be partitioned into two nonoverlapping disjoint subsets and , i.e., .

is the set that is currently believed to contain the true parameter. The statistical problem of interest is to use the observed data to disprove the current (=null) hypothesis, .

 

If there is sufficient evidence to disprove the null hypothesis from the observed data , we reject the null hypothesis (change of belief: ). Otherwise, we do not reject the null hypothesis (we do not make any changes in our current belief on ).

Sufficient evidence means it is unlikely to see the observed data if the null hypothesis is true. We need to decide how unlikely the data has to be to reject the null hypothesis.

 

Test statistic and rejection region

We use a test statistic and an associated rejection region (= critical region) to determine whether we have sufficient evidence to disprove the null hypothesis. We reject the null hypothesis if the observed test statistic is in the rejection region.

Remark : the decision is random, because the decision is based on the observed value of a test statistic. If we collect another sample, we would get a different observed test statistic value, and we may make a different decision.

 

Example: Let equal the breaking strength of a steel bar. Suppose follows a Normal distribution . If the bar is manufactured by process I, . The company recently changed the manufacturing process, and suspects that the breaking strength of a steel bar has increased while the variance of breaking strengths between steel bars remain the same. To test their hypothesis, the company sampled steel bars and recorded their breaking strengths. The sample mean of measurements was . The company claims that the breaking strength of a steel bar increased because the observed sample mean is greater than .

Hypothesis: .

A test statistic: .

An observed test statistic: .

A rejection region = .

Since is in the rejection region, the company rejected the null hypothesis.

 

Remark 1: In general, the form of rejection region is determined by the form of the alternative hypothesis . If

Remark: clearly, the large values of provides a stricter testing procedure. What value should we choose for ?

 

Two types of testing errors

For any fixed rejection region, two types of errors can be made in reaching a decision.

 

The probability of a Type I error,

, is called the significance level of the test.

 

Example: What is the significance level of the test that the steel bar company bar in the previous example used? What would be the significance level of the test if the company decided to reject the null hypothesis if the observed test statistic is greater than ?

  1. Test 1 (rejection region : )

    P(rejecting when is true) =

    Since

  2. Test 2 (rejection region: )

    P(rejecting when is true) =

 

Remark We usually fix the significance level of the test in advance (usually we let ), and make a decision rule so that the type 1 error of the test is .

 

Test statistic and p-value

So far, we have discussed how to determine whether we have sufficient evidence to reject the null hypothesis. The degree of sufficiency was determined by the significance level of the test.

For example, in the steel bar example (),

 

The p-value is defined as the smallest significance level of the test with which the null hypothesis can be rejected with the observed data.

For example, in the steel bar example (),

Therefore, p-value is the probability under the null hypothesis of obtaining a test statistic as extreme as the test statistic actually observed.

 

Remark 1: The smaller the p-value becomes, the more compelling is the evidence that the null hypothesis should be rejected.

Remark 2: the decision rule of "rejecting the null hypothesis when p-value is less than " is a level test.

 

 

Example: Find the p-value when in the previous steel bar example (). Make a decision about the company's claim "the breaking strength of a steel bar has increased with the new manufacturing process" at the significance level of .

Rejection region:

p-value =

Do not reject the null hypothesis at .

 

Steps to perform a hypothesis testing with the significance level

  1. Write the null and alternative hypothesis.
  1. Pick a a good estimator of . Choose a test statistic based on .

    • can be chosen as the test statistic.

    • Often, we work with a function of the estimator and whose distribution is free of . Usually, it is in the form of

      • is often referred to as a standard error (SE) of an estimator
  2. Find the distribution of the test statistic under the null hypothesis (i.e., when ).

  3. Make a decision based on the observed value of the test statistic.

    1. Find a rejection region (=critical region) such that . Reject the null hypothesis if the observed test statistic is in the rejection region. or,
    2. Reject the null hypothesis if the p-value (associated with ) is less than .

 

Duality of confidence intervals with hypothesis tests

Recall that confidence intervals contain plausible values of the parameter of interest . Intuitively, we would think that if a confidence interval does not contain the hypothesized value , there is evidence against the null hypothesis .

There indeed exists an equivalence between hypothesis tests and confidence intervals. It can be shown that for the hypothesis and , the test statistic is outside of the rejection region associated with a significance level test if and only if the hypothesized value is not in the confidence interval. The decision rule is to

 

Example: Construct a % confidence interval for , the breaking strength of a steel bar. The breaking strength of each steel bar follows a and the sample mean of steel bars was . Make a decision about the hypothesis that "the breaking strength of a steel bar is different from with the new manufacturing process" at the significance level of .

.

95% CI =

Since 50 is in the 95% CI, we do not reject the null hypothesis at .

 

The same duality principle holds for the one-sided hypotheses. For example, for the one-sided hypothesis

, we would reject the null hypothesis if the largest plausible value suggested by the data is still smaller than the hypothesized value . For this task, we need to construct a one-sided random interval with confidence coefficient such that . The corresponding confidence interval is . The decision rule is,

 

Recall that two-sided random intervals for are of the from

 

Example: Construct a one-sided % confidence interval which gives a lower bound for , the breaking strength of a steel bar. The breaking strength of each steel bar follows a and the sample mean of steel bars was . Make a decision about the company's claim "the breaking strength of a steel bar has increased from with the new manufacturing process" at the significance level of .

, .

A 95% CI with a lower bound =

Do not reject the null since

Tests about one mean

Learning objective

 

Setting

We have a random sample from some distribution. We would like to perform a hypothesis testing on the population mean of the distribution .

 

Null and alternative hypothesis:

or or

 

1. with known.

  1. Hypothesis

    1. .
    2. .
    3. .

     

  2. is a good estimator. Choose a test statistic :

     

  3. Find the distribution of the test statistic .

    • Under the null hypothesis , . Then , and therefore, .

 

  1. Make a decision based on the observed value of the test statistic.

     

 

 

Example: Let equal the length of life of a 60-watt light bulb marketed by a certain manufacturer. Assume that the distribution of is . Suppose the market standard life length of a 60-watt bulb is hours. The company wants to test whether the true length of life of a 60-watt light bulb is different from the market standard. A random sample of bulbs is tested until they burn out, yielding a sample mean of hours. Perform a statistical test on behalf of the company at the significance level .

  1. Hypothesis
  1. is a good estimator. Choose a test statistic

  2. Find the distribution of the test statistic under the null hypothesis. Under the null hypothesis , . Then , and therefore, .

  3. The observed test statistic is

 

2. with unknown.

  1. Hypothesis

    1. .
    2. .
    3. .

     

  2. is a good estimator. However, note is not a statistic anymore (because is unknown). We consider

     

  3. Find the distribution of the test statistic .

    • Under the null hypothesis , . Then follows a distribution with degrees of freedom .

       

    Lemma: i.i.d. Then, , i.e., a distribution with degrees of freedom of .

  4. Make a decision based on the observed value of the test statistic.

     

 

 

 

Example: In attempting to control the strength of the wastes discharged into a nearby river, a paper firm has taken a number of measures. Members of the firm believe that they have reduced the oxygen-consuming power of their wastes from a previous mean of . The observed values of the sample mean and sample standard deviation from sampled measurements were and . Suppose each measurement follows a Normal distribution. Perform a hypothesis testing for a significance level of .

  1. , .
  2. Test statistic:
  3. The distribution of under the null hypothesis is .
  4. The observed test statistic =

 

 

3. from any distribution, large

  1. Hypothesis

    1. .
    2. .
    3. .

     

  2. is a good estimator. We consider

     

  3. Find the distribution of the test statistic under the null.

    • By CLT, .

    • Since we have a large sample size,

    • Under the null hypothesis both test statistic follows a Normal distribution.

       

  4. Make a decision based on the observed value of the test statistic.

Summary

When we have an i.i.d. random sample ,

SettingsTest statistic
, known; .
, unknown; .
from any distribution, .(approximate) or

where is a sample variance estimator.

 

Tests about two means

Learning objective

 

Setting

We have two samples and from two groups, independent or paired. The goal is to test about the difference of population means of two groups, .

Null and alternative hypothesis:

or or

 

Two independent samples

 

1. Independent samples, , with known.

  1. Hypothesis

    1. .
    2. .
    3. .

     

  2. is a good estimator of . Choose a test statistic :

     

  3. Find the distribution of the test statistic .

    • We have . Under the null hypothesis ,

      , and therefore, .

 

  1. Make a decision based on the observed value of the test statistic.

     

 

Example: The amount of a certain trace element in blood is known to be normally distributed and vary with a standard deviation of 5 ppm (parts per million) for female donors and 10 ppm for male blood donors. Random samples of 25 female and 25 male donors yield concentration means of 33 and 28 ppm, respectively. A doctor wants to know whether the population means of concentrations of the element are higher for women.

  1. Hypothesis
  1. is a good estimator for . Choose a test statistic

  2. Find the distribution of the test statistic under the null hypothesis. Under the null hypothesis .

  3. The observed test statistic is

 

2. Independent samples, , with unknown variances

  1. Hypothesis

    1. .
    2. .
    3. .

     

  2. is a good estimator of . The test statistic in the previous setting,

    is no longer a statistic anymore because are unknown.

     

    1) unknown

    We replace with a sample pooled variance estimator of , where . We consider the following test statistic

    2) unknown, we use the following test statistic instead

  3. Find the distribution of the test statistic .

    1) unknown

    • We have . Under the null hypothesis ,

    2) unknown

    • We use Welch's approximation and obtain

       

  1. Make a decision based on the observed value of the test statistic.

 

Example: The amount of a certain trace element in human blood is known to be normally distributed. Also, it is known that the variances of this trace element are the same between men and women. Random samples of 25 female and 25 male donors yield concentration means of 33 and 28 ppm, respectively, with a standard deviation of 5 ppm for female donors and 10 ppm for male blood donors. A doctor wants to know whether the population means of concentrations of the element are higher for women.

  1. Hypothesis
  1. A test statistic

  2. Find the distribution of the test statistic under the null hypothesis. Under the null hypothesis .

  3. The observed test statistic is

 

3. Independent samples, and , unknown distributions

  1. Hypothesis

    1. .
    2. .
    3. .

     

  2. is a good estimator of . Choose a test statistic :

    when both variances are known or

when both variances are unknown.

  1. Find the distribution of the test statistic .

    From an application of a version of CLT, we have

    and

    since and when and are sufficiently large. Under the null hypothesis , .

  1. Make a decision based on the observed value of the test statistic.

A paired sample

When and are paired, we cannot use previous tests because and are dependent. Similarly as in the interval estimation, we consider another random variable , which is the difference between and . Note, . Therefore we can test

  1. <->
  2. <->
  3. <->

Assuming follows a normal distribution (or a large sample size), we can use the previous hypothesis testing procedure for one mean to test the hypotheses above.

 

Example A researcher wants to study whether lack of sleep impacts cognitive performance. The researcher recruited 10 participants. Each participant is asked to take the tests twice: one after a normal sleep and the other after being kept awake for 24 hours.

 12345678910
First test (normal sleep)8.19.57.211.69.97.31010.710.48.5
Second test (awake for 24 hours)7.08.66.310.78.86.38.99.19.07.5

Suppose it is reasonable to assume that the difference of test scores is normally distributed. The researcher wants to show that lack of sleep decreases cognitive performance.

  1. Hypothesis
  1. A test statistic
  1. Find the distribution of the test statistic under the null hypothesis. Under the null hypothesis .
  2. The observed test statistic is

 

Summary

When we have the observed sample , from two random samples ,

SettingsTest statistic
, , independent, known; .
, , independent unknown; .
, , independent unknown; . where is the df from Welch's approximation
Two independent random samples, large and ; .(approximate)
Paired samples (dependent, ), when , unknown; .
Paired samples (dependent, ), large ;(approximate)

where

 

Tests about proportions

Learning objective

 

One sample (large n)

Suppose we have a random sample . The goal is to test about the population proportion .

  1. Null and alternative hypothesis:

    1. or or or .
  2. is a good estimator of . Choose a test statistic :

  3. By CLT, for , we have,

    When ( is true), .

  4. Make a decision based on the observed value of the test statistic . The decision rules which make an approximate level test are

    • RR: or or

    • p-value: or or

       

Remark: In step 2, we may consider,

Note under the null hypothesis, this test statistic also follows a standard normal distribution. Therefore, the test statistic can be used instead of in step 4. The test based on is called a score test, whereas the test based on is called a Wald test. There isn't any strong preference between two tests, although score tests tend to be preferred as they often result better approximations to a level significance tests.

 

Example: it was claimed that many commercially manufactured dice are not fair because "spots" are really indentations, so that, for example, the 6-side is lighter than the 1-side. Let equal the probability of rolling a six with one of these dice. To test against , several such dice will be rolled to yield a total of observations. Let equal the number of times that six resulted in the trials. The results of the experiment yielded .

  1. Make a conclusion based on the score test. Use
  2. Make a conclusion based on the Wald test. Use

Score test

  1. against
  2. Under the null, .

We reject the null hypothesis at since

Wald test

  1. against
  2. Under the null, .

We reject the null hypothesis at since

 

One sample (exact)

What if the sample size is too small to justify the use of CLT? In such case, we can no longer assume that the distribution of (score test statistic) or (Wald test statistic) under the null hypothesis is close to . The distribution of , where , is not equal to any common distribution that we usually work with. However, we know that the sum of i.i.d. Bernoulli random variables has a Binomial distribution.

 

  1. Null and alternative hypothesis:

    1. or or or .
  2. Choose a test statistic :

  3. Since is sum of i.i.d. Bernoulli random variables where each , we have .

    When ( is true), .

  4. Make a decision based on the observed value of the test statistic .

    • RR:

      • for such that .
      • for such that .
      • or such that .
    • p-value

      • or or

         

Example: Does pineapple belong on a pizza? people out of believe that pineapple belongs on a pizza. Test whether or not using the significance level %, where is the proportion of people who believe that pineapple belongs on a pizza.

  1. and

  2. Test statistic .

  3. Under the null hypothesis, .

  4. We have . Rejection region : or . We need to choose and such that , where .

    kP(Y=k)
    00.0156
    10.0938
    20.2344
    30.3125
    40.2344
    50.0938
    60.0156

    Choose , . Then

    Thus choose the rejection region or .

    Since is not in the rejection region, we do not reject the null hypothesis at %.

     

    P-value:

    Since p-value is greater than , we do not reject the null hypothesis at %.

    We have insufficient evidence to prove that the true proportion of people who believe that pineapple belongs on a pizza is different from %.

 

Two independent samples (large and )

Suppose we have independent samples, , . The goal is to test about the difference of population proportion .

  1. Hypothesis

    1. .
    2. .
    3. .

     

  2. is a good estimator of . Note under the null , and . Since under the null, both , we shall estimate with .

  3. By an application of CLT,

Under the null, and since is large. Therefore, .

  1. Make a decision based on the observed value of the test statistic of .

 

Remark 1: Similarly in one sample case, the test statistic

can be alternatively used (Wald test).

 

Test about variances

Learning objective

 

Test about variances

Suppose we have a random sample . The goal is to test about the population variance .

  1. Null and alternative hypothesis:

    1. or or or .
  2. is a good estimator of . We might initially consider the scaled difference . However, this measure does not follow any common distribution, and also since is not a mean or sum of i.i.d random variables, CLT cannot be applied.

    Recall the definition . From HTZ Theorem 5.5-2, we have,

    That is, .

    We consider the test statistic

  3. Under the null hypothesis .

  4. Make a decision based on the observed value of the test statistic . The decision rules which make a level test are

    • Rejection Region:

      • :
      • :
      • : or .
    • p-value:

      • :
      • :
      • :

 

Several χ2 Distributions

In R, can be obtained by

Also, the probability that where can be computed in R via

Example: Consider a case in which pills are produced for treating a certain medical condition. It is critical that every pill have close to the recommended amount of the active ingredient because too little could render the pill ineffective and too much could be toxic. Suppose that a well-established manufacturing process produces pills for which the standard deviation of the amount of active ingredient is micrograms. Suppose further that a pharmaceutical company has developed a new process for producing the pills. The company wants to test whether the new process reduces the standard deviation of the amount of active ingredient from . A sample of size is taken, where the sample standard deviation of measurement was . Assume the amount of active ingredient in each pill follows a Normal distribution.

  1. , .

  2. Test statistic:

  3. The distribution of under the null hypothesis is .

  4. The observed test statistic is

    Rejection region:

    Since , we reject the null hypothesis.

    or, p-value = where .

    qchisq(p = 1-0.95,df = 22) # 12.338 pchisq(q = 10.78,df = 22) # 0.022

 

Now, suppose we have two independent random sample and The goal is to test about the difference of population variances and .

 

  1. Null and alternative hypothesis:

    1. or or .

    Remark: More generally, we can test or .

     

  2. are good estimators of and . Similarly as in the one sample case, does not follow any common distribution, and also CLT cannot be applied.

     

We consider the test statistic

  1. Fact: if and and are independent, then

    the F distribution with degrees of freedom and .

     

Several F Distributions

 

Recall, and . Thus,

  1. Compute the observed test statistic .

    The decision rules which make a level test are

    • Rejection Region:

      • :
      • :
      • : or

       

      Remark: it can be shown that . To see that, note is the number such that

      where . Since where , independent, . Therefore,

      Therefore, .

      Thus, all cases can be written in terms of right-tail critical regions, and we have, rejection regions of

      • :

      • :

      • : or

         

      Similarly, for p-values,

      • : where

 

Several F Distributions

 

Example A biologist who studies spiders believes that not only do female green lynx spiders tend to be longer than their male counterparts, but also the lengths of the female spiders seem to vary more than those of the male spiders. We shall test whether this latter belief is true. Suppose that the distribution of the length of male spiders is , the distribution of the length Y of female spiders is , and and are independent. We shall test against the alternative hypothesis . Suppose observations of yielded and while observations of yielded and . Use the significance level of 5%.

  1. and .
  2. Test statistic .
  3. The distribution of under the null is .
  4. The observed test statistic .

Rejection region is .

Since , reject the null hypothesis at = 5%.

qf(.05, 34, 26, lower.tail = F) = 1.879.

p-value = .

pf(3.205,34,26,lower.tail = F) = .0016

 

More examples on calculating Type I and II error probabilities and power of a statistical test

Learning objective

 

Recall for a statistical test, there are two types of errors

 

Example

Assume that when given a name tag, a person puts it on either the right or left side. Let p equal the probability that the name tag is placed on the right side. We shall test the null hypothesis against the composite alternative hypothesis We shall give name tags to a random sample of people, denoting the placements of their name tags with Bernoulli random variables, , where if a person places the name tag on the right and if a person places the name tag on the left. Suppose people (out of 10) placed their name tags on the right. Since , for our test statistic, we use .

 

  1. Choose a rejection region so that the significance level of the test is less than or equal to %.
  2. Find the type II error probability when the true .
  3. Find the type II error probability when the true .
 p = 0.5p = 0.2p = 0.01
00.00100.10740.9044
10.00980.26840.0914
20.04390.30200.0042
30.11720.20130.0001
40.20510.08810.0000
50.24610.02640.0000
60.20510.00550.0000
70.11720.00080.0000
80.04390.00010.0000
90.00980.00000.0000
100.00100.00000.0000

Solution

  1. and
  2. Test statistic
  3. Under the null hypothesis, ,
  4. Rejection region should be of the form . Choose such that the type I error probability under the null is less than %. In other words, .

Pmf of where .

Choose . . (Note, ). Due to the discrete nature of , it is not possible to construct a test such that the type I error probability is exactly 5%.

 

Definition: a power of the test at the parameter is defined to be the probability of rejecting the null hypothesis when the true parameter value is . In other words,

(If it is clear from the context which test is considered, is simply written as . In the book, is denoted as .)

Note, for , a power of the test at the parameter is 1-the type II error probability when , since

For example, in the name tag example above,

In fact, we can regard power as a function of the candidate parameter values.

.

 

Interactive plot for the power function

 

Example Consider the previous name tag example. Suppose a statistician wants to be silly, and consider a silly test: the statistician rejects the null hypothesis 5 % of the times regardless of the observed value of . Find the Type I error probability, and Type II error probability when . Find the power function of the test.

P(Type I error ; ) = 1/20.

P(Type II error ; ) = P(do not reject the null when ) = 19/20.

Power() = 1/20.

Note, this is a valid level test. However, the "power" of the test is terrible.

 

Remark 1: for a significance level test, the power curve is always below for by construction. We desire power() be as large as possible for (small risk of committing Type II errors.) Among all significance level tests, clearly, the best test to use is the one that results in the largest power for all (a.k.a., uniformly most powerful test).

Two questions: 1. does UMP test exist? 2. If so, how can we find the UMP test?

We will learn how to find a UMP test in some special cases. It is out of the scope of this class to provide more general answers to these questions. It is worth mentioning that most of the previously presented tests are optimal of some sort (they are either uniformly most powerful within level tests or uniformly most powerful within a more restricted class of tests if UMP does not exist within level tests).

 

Remark 2: Since we are already working with "optimal" tests, for a fixed sample size, both type I and type II error probabilities cannot be made arbitrarily small. For example, given a fixed sample size, we need to increase type I error probability to increase power. The only way of increasing power without increasing the type I error probability is to increase a sample size.

 

image-20210321222646758

 

Remark 3: a fixed sample of size , the type II error probability depends on the distance between the true value and the hypothesized value . If is close to , the true value of (either or ) is difficult to detect, and the probability of accepting when is true tends to be large. On the other hand, if is far from , the true value is relatively easy to detect, and the type II error probability is considerably smaller.

 

Example Let be a random sample of size from the normal distribution , which we can suppose is a possible distribution of scores of students in a statistics course that uses a new method of teaching (e.g., computer-related materials).We wish to decide between (the no-change hypothesis because, let us say, this was the mean score by the previous method of teaching) and the researcher’s hypothesis Let us consider a sample of size and the test statistic .

  1. Choose the rejection region so that the test has the significance level of %.
  2. Find the type II error probability when .
  3. Find the power function of the test.
  4. What is the required sample size for the type II error probability to be less than % if the true parameter value is ?
  1. and

  2. Under the null (i.e., ), . Choose RR to be . Then .

  3. When , Type II error probability = . Need to know the distribution of when .

    Recall, . We have . When , .

    Since ,

  4. power() = P(Reject when ) = .

    When , the distribution of = .

    .

    power() = .

 

  1. When sample size is , . We want to find such that

    We have the distribution of . Then the distribution of when is .

    .

    We need .

    . i.e., .

    Therefore is required.